The Effect of Smoking on Vitamin B12 Deficiency
The Effect of Smoking on Vitamin B12 Deficiency
- 0.1 Preliminaries
- 1 Task 1: Data Source
- 2 Task 2: Load and Tidy the Data
- 3 Task 3: Listing of My Tibble
- 4 Task 4: Code Book
- 5 Task 5: My Subjects
- 6 Task 6: My Variables
- 7 Task 7: My Planned Linear Regression Model
- 8 Task 8: My Planned Logistic Regression Model
- 9 Task 9: Affirmation
- 10 Task 10: An Analysis: Linear Regression
- 10.1 Exploratory data analysis
- 10.2 Model A - kitchen sink with and without nonlinear term
- 10.3 Running “Best Subsets” to select predictors
- 10.4 Comparing model from best subsets to model A with RCS
- 10.5 Validation and calibration of final model
- 10.6 Final Model Summary
- 10.7 Predictions using final model
- 11 Task 11: An Analysis: Logistic Regression
- 11.1 Exploratory Data Analysis
- 11.2 Model 1 - evaluating nonlinear terms
- 11.3 Model 2 - Kitchen sink model with main effects only, evaluate variable selection with stepwise regression
- 11.4 Comparing kitchen sink model to model 3 (smoking, supplementary and dietary B12)
- 11.5 Validating and Calibrating Final Model (model 3)
- 11.6 Summarizing Final Model
- 11.7 Making Predictions - Nomogram
- 11.8 Making predictions - Graphs
- 12 Task 12: Discussion/Conclusions
0.1 Preliminaries
library(knitr); library(rmdformats)
## Global options
opts_chunk$set(comment = NA,
warning = FALSE,
message = FALSE)
#opts_knit$set(width=75)library(arm)
library(leaps)
library(tableone)
library(pander)
library(ROCR)
library(pROC)
library(skimr)
library(modelr)
library(forcats)
library(car)
library(rms)
library(broom)
library(tidyverse)1 Task 1: Data Source
For Project 1, I will be using NHANES 2013-14 data, which is available at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013. I will be utilizing several data files from NHANES, including demographics (DEMO_H), Vitamin B12 laboratory measures (VITB12_H), BMI (BMX_H), Alcohol intake (ALQ_H), Smoking (SMQ_H), Supplements (DSQTOT_H), and Dietary intakes (DR1TOT_H, DR2TOT_H). I plan to investigate the effect of smoking on vitamin B12 level and the presence of B12 deficiency, adjusting for other known risk factors for vitamin B12 deficiency, in adults greater than the age of 40 years.
NHANES was intended to be a survey of the general health and nutrition of the US population. Per the NHANES documentation (https://wwwn.cdc.gov/nchs/data/series/sr02_162.pdf), sampling methods utilized a complex, multistage probability sampling design in order to identify a population representative of the civilian, non-institutionalized US population. Upon agreement to participate in the study, study staff performed in-home screening, interviews, laboratory collection, and physical examination.
Due to sample size requirements, I chose to use a complete case analysis for the variables of interest in the NHANES 2013-2014 dataset, in adults greater than 40 years of age, with n = 1000.
2 Task 2: Load and Tidy the Data
2.1 Data Load
The NHANES data is contained in multiple csv files, which are imported and merged into a single dataframe (nhanes01) in the chunk below.
# Loading in individual csv files from NHANES website
demographics <- read.csv(file="DEMO_H.csv",na.strings=c(" .","9"))
tbl_df(demographics)# A tibble: 10,175 x 3
SEQN RIAGENDR RIDAGEYR
<int> <int> <int>
1 73557 1 69
2 73558 1 54
3 73559 1 72
4 73560 1 NA
5 73561 2 73
6 73562 1 56
7 73563 1 0
8 73564 2 61
9 73565 1 42
10 73566 2 56
# ... with 10,165 more rows
bmi <- read.csv("BMX_H.csv",na.strings=" .")
tbl_df(bmi)# A tibble: 9,813 x 2
SEQN BMXBMI
<int> <dbl>
1 73557 26.7
2 73558 28.6
3 73559 28.9
4 73560 17.1
5 73561 19.7
6 73562 41.7
7 73563 NA
8 73564 35.7
9 73566 26.5
10 73567 22.0
# ... with 9,803 more rows
smk <- read.csv("SMQ_H.csv", na.strings=c(" .","9"))
tbl_df(smk)# A tibble: 7,168 x 2
SEQN SMQ020
<int> <int>
1 73557 1
2 73558 1
3 73559 1
4 73561 2
5 73562 1
6 73564 2
7 73565 1
8 73566 1
9 73567 1
10 73568 2
# ... with 7,158 more rows
etoh <- read.csv("ALQ_H.csv", na.strings=c(" .",""))
tbl_df(etoh)# A tibble: 5,923 x 2
SEQN ALQ101
<int> <int>
1 73557 1
2 73558 1
3 73559 1
4 73561 1
5 73562 1
6 73564 2
7 73566 1
8 73567 1
9 73568 1
10 73571 2
# ... with 5,913 more rows
dr1tot <- read.csv("DR1TOT_H.csv", na.strings=c(" ."))
tbl_df(dr1tot)# A tibble: 9,813 x 2
SEQN DR1TVB12
<int> <dbl>
1 73557 2.79
2 73558 21.4
3 73559 3.78
4 73560 8.76
5 73561 8.30
6 73562 1.68
7 73563 NA
8 73564 3.30
9 73566 2.73
10 73567 0.790
# ... with 9,803 more rows
dr2tot <- read.csv("DR2TOT_H.csv", na.strings=c(" .",""))
tbl_df(dr2tot)# A tibble: 9,813 x 2
SEQN DR2TVB12
<int> <dbl>
1 73557 2.62
2 73558 4.24
3 73559 4.96
4 73560 3.49
5 73561 9.11
6 73562 NA
7 73563 NA
8 73564 13.9
9 73566 NA
10 73567 NA
# ... with 9,803 more rows
dsqtot <- read.csv("DSQTOT_H.csv", na.strings=c(" .",""))
tbl_df(dsqtot)# A tibble: 10,175 x 3
SEQN DSD010AN DSQTVB12
<int> <int> <dbl>
1 73557 2 NA
2 73558 2 NA
3 73559 2 18.0
4 73560 2 NA
5 73561 2 NA
6 73562 2 NA
7 73563 2 NA
8 73564 1 40.0
9 73565 2 NA
10 73566 1 6.00
# ... with 10,165 more rows
b12 <- read.csv("VITB12_H.csv", na.strings=c(" .",""))
tbl_df(b12)# A tibble: 5,736 x 2
SEQN LBDB12
<int> <int>
1 73557 524
2 73558 507
3 73559 732
4 73561 225
5 73562 750
6 73564 668
7 73566 378
8 73567 194
9 73568 528
10 73571 421
# ... with 5,726 more rows
# Merging into one dataframe
m1 <- merge(demographics,bmi,by="SEQN")
m2 <- merge(m1,smk,by="SEQN")
m3 <- merge(m2,etoh,by="SEQN")
m4 <- merge(m3,dr1tot,by="SEQN")
m5 <- merge(m4,dr2tot,by="SEQN")
m6 <- merge(m5,dsqtot,by="SEQN")
m7 <- merge(m6,b12,by="SEQN")
str(m7)'data.frame': 5735 obs. of 11 variables:
$ SEQN : int 73557 73558 73559 73561 73562 73564 73566 73567 73568 73571 ...
$ RIAGENDR: int 1 1 1 2 1 2 2 1 2 1 ...
$ RIDAGEYR: int 69 54 72 73 56 61 56 65 26 76 ...
$ BMXBMI : num 26.7 28.6 28.9 19.7 41.7 35.7 26.5 22 20.3 34.4 ...
$ SMQ020 : int 1 1 1 2 1 2 1 1 2 2 ...
$ ALQ101 : int 1 1 1 1 1 2 1 1 1 2 ...
$ DR1TVB12: num 2.79 21.45 3.78 8.3 1.68 ...
$ DR2TVB12: num 2.62 4.24 4.96 9.11 NA 13.9 NA NA 7.65 5.61 ...
$ DSD010AN: int 2 2 2 2 2 1 1 1 2 2 ...
$ DSQTVB12: num NA NA 18 NA NA 40 6 NA NA NA ...
$ LBDB12 : int 524 507 732 225 750 668 378 194 528 421 ...
summary(m7) SEQN RIAGENDR RIDAGEYR BMXBMI
Min. :73557 Min. :1.000 Min. :19.00 Min. :14.10
1st Qu.:76160 1st Qu.:1.000 1st Qu.:33.00 1st Qu.:24.00
Median :78712 Median :2.000 Median :48.00 Median :27.80
Mean :78672 Mean :1.524 Mean :48.37 Mean :29.04
3rd Qu.:81166 3rd Qu.:2.000 3rd Qu.:63.00 3rd Qu.:32.40
Max. :83727 Max. :2.000 Max. :80.00 Max. :82.90
NA's :75
SMQ020 ALQ101 DR1TVB12 DR2TVB12
Min. :1.00 Min. :1.000 Min. : 0.000 Min. : 0.000
1st Qu.:1.00 1st Qu.:1.000 1st Qu.: 1.975 1st Qu.: 1.900
Median :2.00 Median :1.000 Median : 3.580 Median : 3.530
Mean :1.57 Mean :1.299 Mean : 4.871 Mean : 4.775
3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.: 6.030 3rd Qu.: 5.960
Max. :2.00 Max. :9.000 Max. :94.770 Max. :107.040
NA's :2 NA's :493 NA's :560 NA's :1141
DSD010AN DSQTVB12 LBDB12
Min. :1.000 Min. : 0.16 Min. : 18.0
1st Qu.:2.000 1st Qu.: 6.00 1st Qu.: 381.0
Median :2.000 Median : 16.67 Median : 514.0
Mean :1.903 Mean : 220.77 Mean : 637.5
3rd Qu.:2.000 3rd Qu.: 50.00 3rd Qu.: 708.0
Max. :9.000 Max. :15100.00 Max. :26801.0
NA's :3909 NA's :289
nhanes01 <- m72.2 Tidying, Data Cleaning and Data Management
After importing the NHANES individual csv files and merging them to create my dataframe (initially called nhanes01), I further tidied the data by renaming continuous variables, renaming and releveling categorical variables, and creating an indicator variable for the binary outcome of interest (vitamin B12 deficiency), as detailed below:
ID: SEQN was renamed to id.
Age: RIDAGEYR was renamed to age.
Vitamin B12: LBDB12 was renamed to vit_b12, which is the subject’s vitamin B12 level in pg/ml.
Supplemental B12: DSQTVB12 was renamed to supp_b12.
Dietary B12: The mean of dietary intake of B12 on days 1 and 2 of evaluation, DR1TVB12 and DR2TVB12 (respectively), was calculated to create the variable diet_b12.
BMI: BMXBMI was renamed to bmi, then BMI categories (bmi_cat) were created for usage as a multicategorical predictor (Underweight (BMI<18.5), Normal (BMI 18.5-25.0), Overweight (BMI 25.0-30.0), and Obese (BMI>30.0) per CDC definitions of each category. The final BMI variable for my dataset is called bmi_cat.
Smoking: SMQ020, which was originally coded as 1 = Yes and 2 = No, was recoded as a binary indicator variable for whether a subject had smoked 100 cigarettes in his or her life (1 = Yes, 0 = no). This new variable is called smk100.
Alcohol: ALQ101 (whether a subject has consumed at least 12 alcoholic beverages in the past year) was renamed to etoh. It was originally coded as 1 = Yes and 2 = No, and was therefore recoded as a binary indicator variable (1 = Yes, 0 = No). Values in NHANES data of 9 indicated “Don’t know”, so these values were converted to NA.
Antacid use: DSD010AN in the original data, which asks the subject if he/she has used antacids in the past 30 days (with original coding 1 = Yes, 2 = No, 7 = Don’t know, and 9 = Refused), was recoded to a binary indicator variable where 1 = Yes, 0 = No, and the subjects who responded “Don’t know” or “refused” was coded as NA.
Vitamin B12 deficiency: I created a binary variable to indicate whether a subject had vitamin B12 deficiency, if vitamin B12 level (vit_b12) was <300.
## Renaming continuous variables
# ID
nhanes01$id_int <- nhanes01$SEQN
nhanes01$id <- as.character(nhanes01$id_int)
Hmisc::describe(nhanes01$id)nhanes01$id
n missing distinct
5735 0 5735
lowest : 73557 73558 73559 73561 73562, highest: 83721 83723 83724 83726 83727
# BMI
nhanes01$bmi <- nhanes01$BMXBMI
Hmisc::describe(nhanes01$bmi)nhanes01$bmi
n missing distinct Info Mean Gmd .05 .10
5660 75 400 1 29.04 7.651 19.90 21.30
.25 .50 .75 .90 .95
24.00 27.80 32.40 38.21 42.60
lowest : 14.1 14.2 15.2 15.4 16.1, highest: 70.1 71.5 74.1 77.5 82.9
# Age
nhanes01$age <- nhanes01$RIDAGEYR
Hmisc::describe(nhanes01$age)nhanes01$age
n missing distinct Info Mean Gmd .05 .10
5735 0 62 1 48.37 20.67 21 24
.25 .50 .75 .90 .95
33 48 63 74 80
lowest : 19 20 21 22 23, highest: 76 77 78 79 80
# Vitamin B12
nhanes01$vit_b12 <- nhanes01$LBDB12
Hmisc::describe(nhanes01$vit_b12)nhanes01$vit_b12
n missing distinct Info Mean Gmd .05 .10
5446 289 1066 1 637.5 415.5 248.2 293.0
.25 .50 .75 .90 .95
381.0 514.0 708.0 979.5 1259.0
lowest : 18 66 70 78 86, highest: 12494 12600 14900 15700 26801
# Supplemental B12
nhanes01$supp_b12 <- nhanes01$DSQTVB12
Hmisc::describe(nhanes01$supp_b12)nhanes01$supp_b12
n missing distinct Info Mean Gmd .05 .10
1826 3909 308 0.995 220.8 385.2 1.33 3.00
.25 .50 .75 .90 .95
6.00 16.67 50.00 603.75 1012.00
Value 0 200 400 600 800 1000 1200 1400 1600 1800
Frequency 1504 88 31 23 9 98 6 4 4 2
Proportion 0.824 0.048 0.017 0.013 0.005 0.054 0.003 0.002 0.002 0.001
Value 2000 2200 2400 2600 4000 4200 5000 6800 10000 15200
Frequency 12 1 15 12 1 1 10 1 3 1
Proportion 0.007 0.001 0.008 0.007 0.001 0.001 0.005 0.001 0.002 0.001
# Dietary B12
nhanes01$diet_b12 <- ((nhanes01$DR1TVB12+nhanes01$DR2TVB12)/2)
Hmisc::describe(nhanes01$diet_b12)nhanes01$diet_b12
n missing distinct Info Mean Gmd .05 .10
4594 1141 2286 1 4.797 3.623 1.088 1.480
.25 .50 .75 .90 .95
2.445 3.845 5.940 8.867 11.620
lowest : 0.010 0.015 0.025 0.055 0.080, highest: 39.920 40.830 48.370 52.355 54.415
## Managing categorical variables
# BMI
nhanes01$bmi_cat <- Hmisc::cut2(nhanes01$bmi, cuts = c(18.5, 25, 30))
Hmisc::describe(nhanes01$bmi_cat)nhanes01$bmi_cat
n missing distinct
5660 75 4
Value [14.1,18.5) [18.5,25.0) [25.0,30.0) [30.0,82.9]
Frequency 103 1635 1801 2121
Proportion 0.018 0.289 0.318 0.375
nhanes01$bmi_cat2 <- nhanes01$bmi_cat
levels(nhanes01$bmi_cat2)[1] "[14.1,18.5)" "[18.5,25.0)" "[25.0,30.0)" "[30.0,82.9]"
nhanes01$bmi_cat2 <- fct_recode(nhanes01$bmi_cat,
"Underweight" = "[14.1,18.5)",
"Normal" = "[18.5,25.0)",
"Overweight" = "[25.0,30.0)",
"Obese" = "[30.0,82.9]")
table(nhanes01$bmi_cat, nhanes01$bmi_cat2)
Underweight Normal Overweight Obese
[14.1,18.5) 103 0 0 0
[18.5,25.0) 0 1635 0 0
[25.0,30.0) 0 0 1801 0
[30.0,82.9] 0 0 0 2121
nhanes01$bmi_cat <- nhanes01$bmi_cat2
# Smoking
nhanes01$smk100 <- ifelse(nhanes01$SMQ020==1, 1, 0)
table(nhanes01$smk100, nhanes01$SMQ020)
1 2
0 0 3266
1 2467 0
# Alcohol
nhanes01$etoh <- ifelse(nhanes01$ALQ101==1,1,ifelse(nhanes01$ALQ101==9,NA,0))
table(nhanes01$etoh, nhanes01$ALQ101)
1 2 9
0 0 1505 0
1 3729 0 0
# Antacid use
nhanes01$antacid <- ifelse(nhanes01$DSD010AN==1,1,ifelse(nhanes01$DSD010AN==7,NA, ifelse(nhanes01$DSD010AN==9,NA,0)))
table(nhanes01$antacid, nhanes01$DSD010AN)
1 2 7 9
0 0 5167 0 0
1 566 0 0 0
# Creating indicator variable for vitamin B12 deficiency (vit_b12<300)
nhanes01$b12def <- ifelse(nhanes01$vit_b12<300,1,0)2.2.1 Subset Columns
Given the sample size requirements, I selected complete cases for the variables of interest, and subsetted by age greater than 40, with the final n = 1000. I created a tidy csv file with the variables of interest for export, which is then imported for further analysis below.
# Selecting variables of interest
nhanes02 <- nhanes01 %>%
dplyr::select(id, age, vit_b12, b12def, smk100, bmi_cat, etoh, antacid, supp_b12, diet_b12)
# Selecting complete cases for sample size requirements
b12_final <- nhanes02 %>%
filter(complete.cases(.)) %>%
filter(age > 40)
tbl_df(b12_final)# A tibble: 1,000 x 10
id age vit_b12 b12def smk100 bmi_cat etoh antacid supp_b12
<chr> <int> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 73559 72 732 0 1.00 Overweight 1.00 0 18.0
2 73564 61 668 0 0 Obese 0 1.00 40.0
3 73610 43 548 0 0 Overweight 1.00 0 18.0
4 73613 60 538 0 0 Obese 1.00 0 16.7
5 73616 62 318 0 1.00 Normal 1.00 0 16.7
6 73622 72 378 0 0 Overweight 1.00 0 4.00
7 73628 80 4560 0 1.00 Normal 1.00 1.00 2500
8 73642 57 1040 0 1.00 Obese 0 1.00 1010
9 73643 62 904 0 1.00 Overweight 1.00 0 25.0
10 73659 70 697 0 0 Obese 1.00 0 1000
# ... with 990 more rows, and 1 more variable: diet_b12 <dbl>
# Evaluating values of outcome in new dataset
Hmisc::describe(b12_final$vit_b12)b12_final$vit_b12
n missing distinct Info Mean Gmd .05 .10
1000 0 626 1 875.1 641.2 304.9 362.9
.25 .50 .75 .90 .95
481.0 669.0 933.2 1344.6 2100.2
lowest : 154 157 160 173 174, highest: 7580 7750 9317 12099 12494
table(b12_final$b12def)
0 1
952 48
# Write tidy csv file
write.csv(b12_final,"Baldassari-tidy.csv", row.names=FALSE,na="NA")# Read back in tidy csv file to start from tidy dataset
b12_final <- read.csv("Baldassari-tidy.csv",na.strings="NA")2.3 Are there missing values?
Below, the complete case analysis (lack of missing values) is confirmed.
na.pattern(b12_final)pattern
0000000000
1000
3 Task 3: Listing of My Tibble
glimpse(b12_final)Observations: 1,000
Variables: 10
$ id <int> 73559, 73564, 73610, 73613, 73616, 73622, 73628, 7364...
$ age <int> 72, 61, 43, 60, 62, 72, 80, 57, 62, 70, 59, 70, 51, 6...
$ vit_b12 <int> 732, 668, 548, 538, 318, 378, 4560, 1040, 904, 697, 5...
$ b12def <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ smk100 <int> 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,...
$ bmi_cat <fct> Overweight, Obese, Overweight, Obese, Normal, Overwei...
$ etoh <int> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ antacid <int> 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,...
$ supp_b12 <dbl> 18.00, 40.00, 18.00, 16.67, 16.67, 4.00, 2500.00, 100...
$ diet_b12 <dbl> 4.370, 8.600, 3.690, 5.470, 2.305, 4.360, 3.530, 2.48...
str(b12_final)'data.frame': 1000 obs. of 10 variables:
$ id : int 73559 73564 73610 73613 73616 73622 73628 73642 73643 73659 ...
$ age : int 72 61 43 60 62 72 80 57 62 70 ...
$ vit_b12 : int 732 668 548 538 318 378 4560 1040 904 697 ...
$ b12def : int 0 0 0 0 0 0 0 0 0 0 ...
$ smk100 : int 1 0 0 0 1 0 1 1 1 0 ...
$ bmi_cat : Factor w/ 4 levels "Normal","Obese",..: 3 2 3 2 1 3 1 2 3 2 ...
$ etoh : int 1 0 1 1 1 1 1 0 1 1 ...
$ antacid : int 0 1 0 0 0 0 1 1 0 0 ...
$ supp_b12: num 18 40 18 16.7 16.7 ...
$ diet_b12: num 4.37 8.6 3.69 5.47 2.31 ...
My tibble has 1000 subject/rows and 10 variables/columns.
4 Task 4: Code Book
Below is the codebook for my Project 1.
attach(b12_final)| Variable | Class | Description | Range or Levels | NA |
|---|---|---|---|---|
id |
integer | patient identification code | Range: 73559, 83721 | - |
age |
integer | age (years) | Range: 41, 80 | - |
vit_b12 |
integer | Vitamin B12 level (pg/ml) | Range: 154, 12494 | - |
b12def |
integer | B12 deficiency (B12 level <300 pg/ml) (1 = yes, 0 = no) | 48 (4.8%) b12def | - |
smk100 |
integer | Smoked 100 cigarettes in lifetime (1 = yes, 0 = no) | 450 (45%) smk100 | - |
bmi_cat |
factor | BMI category (Underweight, Normal, Overweight, or Obese) | 13 (1.3%) Underweight, 254 (25.4%) Normal, 361 (36.1%) Overweight, 372 (37.2%) Obese | - |
etoh |
integer | At least 12 alcoholic beverages in the past year (1 = yes, 0 = no) | NA (NA%) etoh | - |
antacid |
integer | Antacid use in the past 30 days (1 = yes, 0 = no) | 142 (14.2%) antacid | - |
supp_b12 |
numeric | Supplemental Vitamin B12 level intake (mcg) | Range: 0.16, 1.5110^{4} | - |
diet_b12 |
numeric | Dietary Vitamin B12 level intake (mcg) | Range: 0.16, 54.415 | - |
detach(b12_final)As this was a complete case analysis, there are no NAs.
5 Task 5: My Subjects
The data include a sample of NHANES 2013-2014 data with 1000 subjects greater than the age of 40 who had complete information on the predictors and outcome of interest (vitamin B12 level).
6 Task 6: My Variables
There are 10 variables in the b12_final data set.
- id
- This is a subject identification code, ranging from 73564 to 83721. It is a renamed version of SEQN from the original NHANES data.
- age
- This is the patient’s age in years. Age is the renamed version of RIDAGEYR from the Demographics table of NHANES, and represents the subject’s age in years. All subjects in this dataset are greater than 40 years of age.
- vit_b12
- This is the patient’s serum vitamin B12 level, in pg/ml. vit_b12 is the renamed version of LBDB12 from the original NHANES data (B12 Table).
- b12def
- This is the binary indicator variable I created for Vitamin B12 deficiency, my binary outcome of interest. b12def = 1 if the patient has vitamin B12 deficiency (defined as vitamin B12 level <300), and is 0 if they do not.
- smk100
- This variable indicates whether a subject has smoked at least 100 cigarettes in their lifetime (1= Yes, 0 = No). It is the renamed version of SMQ020 from the Smoking Questionnaire from NHANES data.
- bmi_cat
- This variable indicates the subject’s BMI category, and can take 4 possible values: Underweight (BMI<18.5), Normal (BMI 18.5-25.0), Overweight (BMI 25.0-30.0), and Obese (BMI>30.0). It was derived from BMI data available via NHANES Body Measures Table, and the variable for BMI there was BMXBMI.
- etoh
- This is a binary indicator variable for whether a subject reported drinking at least 12 alcoholic beverages in the past year (1 = Yes, 0 = No).
- antacid
- This is a binary indicator variable for whether a subject reported taking an antacid medication in the past 30 days (1 = Yes, 0 = No).
- supp_b12
- This is the amount of supplementary vitamin B12 a patient took in the 30 days prior to evaluation, in mcg.
- diet_b12
- This is the daily mean amount of dietary vitamin B12 a patient ingested over the 2 days of dietary evaluation, in mcg.
7 Task 7: My Planned Linear Regression Model
For my planned linear regression model, I plan to use vitamin B12 level (pg/ml) as a quantitative variable, and my candidate predictors will be the following:
agesmk100bmi_catetohantacidsupp_b12diet_b12
The key predictor I will focus on here is smoking.
7.1 Spearman \(\rho^2\) Plot
A Spearman \(\rho^2\) plot indicates that supplemental B12 is the strongest predictor for vitamin B12 level, and alcohol and age are other variables of interest but with a much smaller predictive punch. Supplemental B12 is an important confounder of this relationship that will be important to account for in this analysis.
plot(spearman2(vit_b12 ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))8 Task 8: My Planned Logistic Regression Model
For my planned logistic regression model, my binary outcome will be vitamin B12 deficiency (defined as serum vitamin B12 level < 300), represented by the variable b12def. My candidate predictors will be the same as above:
agesmk100bmi_catetohantacidsupp_b12diet_b12
The key predictor I will focus on here is smoking.
8.1 Spearman \(\rho^2\) Plot
A Spearman \(\rho^2\) plot indicates that supplemental and dietary B12, as well as smoking, are the strongest predictors for vitamin B12 deficiency. Supplemental and dietary B12 are important confounders of this relationship that will be important to account for in this analysis.
plot(spearman2(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))9 Task 9: Affirmation
This data set meets all requirements specified in the project instructions.
- The data set contains 1000 observations on 10 variables, which is within the limits of 100-1000 observations on 7-20 variables set in the assignment.
- There are no missing values in the data.
- I am considering at least four predictors for each regression model, and I include at least one quantitative (for example,
diet_b12) and multi-categorical variable (for example,bmi_cat) in each model. - I (Laura Baldassari) am (is) certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security, mostly because the data have been on a public website for many years, and are completely free of identifying information about individual subjects.
10 Task 10: An Analysis: Linear Regression
10.1 Exploratory data analysis
For further exploratory analysis, I utilized skim and plotted a histogram, Boxplot, and normal QQ plot of the outcome (vitamin B12 level). These analyses revealed that the outcome is extremely right skewed due to outliers. I therefore evaluated for possible transformation via BoxCox analysis.
skim(b12_final)Skim summary statistics
n obs: 1000
n variables: 10
Variable type: factor
variable missing complete n n_unique
bmi_cat 0 1000 1000 4
top_counts ordered
Obe: 372, Ove: 361, Nor: 254, Und: 13 FALSE
Variable type: integer
variable missing complete n mean sd p0 p25 median
age 0 1000 1000 61.01 11.87 41 51 61
antacid 0 1000 1000 0.14 0.35 0 0 0
b12def 0 1000 1000 0.048 0.21 0 0 0
etoh 0 1000 1000 0.74 0.44 0 0 1
id 0 1000 1000 78566.26 2937.18 73559 75995.25 78582
smk100 0 1000 1000 0.45 0.5 0 0 0
vit_b12 0 1000 1000 875.15 943.18 154 481 669
p75 p100 hist
70 80 ▆▅▆▇▇▆▅▇
0 1 ▇▁▁▁▁▁▁▁
0 1 ▇▁▁▁▁▁▁▁
1 1 ▃▁▁▁▁▁▁▇
81011.5 83721 ▇▇▇▇▇▇▇▇
1 1 ▇▁▁▁▁▁▁▆
933.25 12494 ▇▁▁▁▁▁▁▁
Variable type: numeric
variable missing complete n mean sd p0 p25 median p75
diet_b12 0 1000 1000 4.76 4.11 0.16 2.49 3.8 5.86
supp_b12 0 1000 1000 256.76 811.31 0.16 6 25 75
p100 hist
54.41 ▇▂▁▁▁▁▁▁
15100 ▇▁▁▁▁▁▁▁
# Histogram of B12 level
ggplot(b12_final, aes(x= vit_b12)) +
geom_histogram(aes(y=..density..)) +
stat_function(fun=dnorm,
args=list(mean=mean(b12_final$vit_b12),
sd=sd(b12_final$vit_b12)),
lwd = 1.5) +
labs(title = "Histogram of B12 level", x="Vitamin B12 level", y="") +
theme_bw()# Boxplot of B12 level
ggplot(b12_final, aes(x = 1, y = vit_b12)) +
geom_boxplot(notch = TRUE) +
labs(title = "Boxplot", x = "", y = "") +
theme_bw() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) # Normal QQ plot of B12 level
qqnorm(b12_final$vit_b12, main = "Normal Q-Q plot for vitamin B12")
qqline(b12_final$vit_b12, col = "red")# Box-Cox for B12 level
boxCox(lm(vit_b12 ~ diet_b12 + supp_b12 + bmi_cat + etoh + smk100 + age + antacid, data=b12_final))powerTransform(lm(vit_b12 ~ diet_b12 + supp_b12 + bmi_cat + etoh + smk100 + age + antacid, data=b12_final))Estimated transformation parameters
Y1
-0.3007455
The BoxCox analysis estimates a transformation parameter of -0.3, which is closest to -0.5. I will therefore transform the outcome (vitamin B12 level) per Tukey’s ladder to 1/sqrt(y).
# Transforming B12 to 1/sqrt(B12)
b12_final$b12_trans <- (1/(sqrt(b12_final$vit_b12)))
ggplot(b12_final, aes(x=vit_b12, y = b12_trans)) + geom_point()# Histogram of transformed B12 level
ggplot(b12_final, aes(x= b12_trans)) +
geom_histogram(aes(y=..density..)) +
stat_function(fun=dnorm,
args=list(mean=mean(b12_final$b12_trans),
sd=sd(b12_final$b12_trans)),
lwd = 1.5) +
labs(title = "Histogram of transformed B12 level", x="1/sqrt(Vitamin B12 level)", y="") + theme_bw()# Boxplot of B12 level
ggplot(b12_final, aes(x = 1, y = b12_trans)) +
geom_boxplot(notch = TRUE) +
labs(title = "Boxplot", x = "", y = "") +
theme_bw() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) # Normal QQ plot of B12 level
qqnorm(b12_final$b12_trans, main = "Normal Q-Q plot for transformed vitamin B12")
qqline(b12_final$b12_trans, col = "red")The transformed outcome now appears to be approximately normally distributed on histogram, boxplot, and normal QQ plot, which is a marked improvement from its untransformed values.
After transforming the outcome, I reevaluated a Spearman \(\rho^2\) plot to determine which predictors were most promising for nonlinear terms. The plot again indicates that supplemental B12 is the strongest predictor for vitamin B12 level, and alcohol and age are other variables of interest but with a much smaller predictive punch. Therefore, I will use supplemental B12, alcohol, and age as considerations for a nonlinear term. Supplemental B12 is an important confounder that will be important to account for in this analysis.
plot(spearman2(b12_trans ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))For this analysis, the approximate number of degrees of freedom is n/15 = 1000/15 = 66. Therefore, I will spend 3 degrees of freedom evaluating supplementary B12, alcohol, and age for possible non-linearity via scatterplots versus vitamin B12 level.
ggplot(data=b12_final, aes(x= supp_b12, y = b12_trans)) +
geom_point() +
geom_smooth(method= "lm", se= FALSE) +
geom_smooth(method= "loess", se=FALSE)ggplot(data=b12_final, aes(x= etoh, y = b12_trans)) +
geom_point()+
geom_smooth(method= "lm", se= FALSE)+
geom_smooth(method= "loess", se=FALSE)ggplot(data=b12_final, aes(x= age, y = b12_trans)) +
geom_point()+
geom_smooth(method= "lm", se= FALSE)+
geom_smooth(method= "loess", se=FALSE)10.2 Model A - kitchen sink with and without nonlinear term
Given the scatterplots above, I created a modified kitchen sink model with a restricted cubic spline with 4 knots for supplemental B12 intake. It does not appear appropriate to try to fit a nonlinear term to alcohol and age.
modelA_rcs <- lm(b12_trans ~ rcs(supp_b12,4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data=b12_final)
summary(modelA_rcs)
Call:
lm(formula = b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat +
etoh + antacid + diet_b12, data = b12_final)
Residuals:
Min 1Q Median 3Q Max
-0.033944 -0.006466 -0.000949 0.005646 0.040411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.304e-02 1.922e-03 22.397 < 2e-16 ***
rcs(supp_b12, 4)supp_b12 -2.630e-04 6.538e-05 -4.022 6.21e-05 ***
rcs(supp_b12, 4)supp_b12' 2.495e-01 6.979e-02 3.575 0.000368 ***
rcs(supp_b12, 4)supp_b12'' -3.537e-01 9.906e-02 -3.571 0.000373 ***
age -1.003e-05 2.723e-05 -0.368 0.712664
smk100 -5.524e-04 6.597e-04 -0.837 0.402636
bmi_catObese 1.610e-03 8.064e-04 1.996 0.046198 *
bmi_catOverweight 8.171e-04 8.040e-04 1.016 0.309758
bmi_catUnderweight -1.487e-03 2.797e-03 -0.532 0.595032
etoh 2.492e-03 7.550e-04 3.301 0.000998 ***
antacid -6.535e-04 8.939e-04 -0.731 0.464918
diet_b12 -7.349e-05 7.578e-05 -0.970 0.332430
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.009799 on 988 degrees of freedom
Multiple R-squared: 0.1789, Adjusted R-squared: 0.1698
F-statistic: 19.57 on 11 and 988 DF, p-value: < 2.2e-16
anova(modelA_rcs)Analysis of Variance Table
Response: b12_trans
Df Sum Sq Mean Sq F value Pr(>F)
rcs(supp_b12, 4) 3 0.019122 0.0063741 66.3813 < 2.2e-16 ***
age 1 0.000095 0.0000953 0.9922 0.319456
smk100 1 0.000005 0.0000051 0.0534 0.817239
bmi_cat 3 0.000320 0.0001066 1.1097 0.344149
etoh 1 0.000985 0.0009846 10.2534 0.001408 **
antacid 1 0.000053 0.0000533 0.5547 0.456589
diet_b12 1 0.000090 0.0000903 0.9403 0.332430
Residuals 988 0.094870 0.0000960
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create OLS model to create nomogram
d <- datadist(b12_final)
options(datadist="d")
modelA_rcs_ols <- ols(b12_trans ~ rcs(supp_b12,4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data = b12_final,x = TRUE, y = TRUE)
plot(nomogram(modelA_rcs_ols))Based on the nonlinear model above’s nomogram, it is apparent that the restricted cubic spline supplemental B12 clearly has the strongest impact upon the transformed vitamin B12 level.
modelA <- lm(b12_trans ~ supp_b12 + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data=b12_final)
summary(modelA)
Call:
lm(formula = b12_trans ~ supp_b12 + age + smk100 + bmi_cat +
etoh + antacid + diet_b12, data = b12_final)
Residuals:
Min 1Q Median 3Q Max
-0.031024 -0.006603 -0.001004 0.005807 0.043818
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.198e-02 1.973e-03 21.277 < 2e-16 ***
supp_b12 -4.037e-06 3.995e-07 -10.106 < 2e-16 ***
age -5.064e-05 2.764e-05 -1.832 0.06729 .
smk100 -9.431e-04 6.838e-04 -1.379 0.16814
bmi_catObese 1.177e-03 8.368e-04 1.406 0.15998
bmi_catOverweight 5.953e-04 8.355e-04 0.712 0.47633
bmi_catUnderweight -2.150e-03 2.906e-03 -0.740 0.45957
etoh 2.457e-03 7.839e-04 3.135 0.00177 **
antacid -6.838e-04 9.283e-04 -0.737 0.46148
diet_b12 -7.467e-05 7.868e-05 -0.949 0.34282
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01019 on 990 degrees of freedom
Multiple R-squared: 0.1104, Adjusted R-squared: 0.1023
F-statistic: 13.65 on 9 and 990 DF, p-value: < 2.2e-16
anova(modelA)Analysis of Variance Table
Response: b12_trans
Df Sum Sq Mean Sq F value Pr(>F)
supp_b12 1 0.010799 0.0107986 104.0122 < 2.2e-16 ***
age 1 0.000632 0.0006319 6.0868 0.013788 *
smk100 1 0.000017 0.0000173 0.1663 0.683494
bmi_cat 3 0.000204 0.0000681 0.6561 0.579199
etoh 1 0.000954 0.0009541 9.1902 0.002496 **
antacid 1 0.000059 0.0000589 0.5669 0.451667
diet_b12 1 0.000094 0.0000935 0.9007 0.342824
Residuals 990 0.102782 0.0001038
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the model with a restricted cubic spline (RCS), all knots for supplemental B12 RCS and alcohol appear to have statistically significant predictive value, per ANOVA. BMI category of obese also has some marginal statistically significant predictive value in the linear model, but BMI as a term does not with ANOVA testing.
However, without the restricted cubic spline, the terms for supplemental B12, age, and alcohol appear to have statistically significant predictive value.
Further model comparison was performed via ANOVA nested model comparison and glance to compare AIC/BIC.
# Nested model comparison by ANOVA
anova(modelA, modelA_rcs)Analysis of Variance Table
Model 1: b12_trans ~ supp_b12 + age + smk100 + bmi_cat + etoh + antacid +
diet_b12
Model 2: b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat + etoh +
antacid + diet_b12
Res.Df RSS Df Sum of Sq F Pr(>F)
1 990 0.10278
2 988 0.09487 2 0.0079119 41.198 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Glance
glance(modelA_rcs) r.squared adj.r.squared sigma statistic p.value df logLik
1 0.1789021 0.1697603 0.009799119 19.56972 6.025097e-36 12 3212.561
AIC BIC deviance df.residual
1 -6399.121 -6335.32 0.09487045 988
glance(modelA) r.squared adj.r.squared sigma statistic p.value df logLik
1 0.1104255 0.1023384 0.01018924 13.65462 7.822512e-21 10 3172.51
AIC BIC deviance df.residual
1 -6323.02 -6269.035 0.1027823 990
The ANOVA testing for nested models indicates that there is significant predictive value offered by the restricted cubic spline term, as p < 0.05 with the addition of the term to the model.
| Model | R2 | Adjusted R2 | AIC | BIC | degrees of freedom |
|---|---|---|---|---|---|
| A with RCS | 0.179 | 0.170 | -6399.1 | -6335.3 | 12 |
| A without RCS | 0.110 | 0.102 | -6323.0 | -6269.0 | 10 |
Based on its higher multiple/adjusted R and lower AIC/BIC, despite use of 2 additional degrees of freedom, the kitchen sink model with the restricted cubic spline outperforms the kitchen sink model without it.
I will proceed with best subsets analysis to further refine the search for the best predictors, keeping model A with a restricted cubic spline for later comparison.
10.3 Running “Best Subsets” to select predictors
Since smoking is my primary predictor of interest, I ran best subsets regression with forcing my smoking variable (smk100) into each model.
library(leaps)
preds<- with(b12_final, cbind(supp_b12, age, smk100, etoh, antacid, diet_b12))
x1 <- regsubsets(preds, y=b12_final$b12_trans, data=b12_final, force.in=c("smk100"))
rs <- summary(x1)
rsSubset selection object
6 Variables (and intercept)
Forced in Forced out
smk100 FALSE FALSE
supp_b12 FALSE FALSE
age TRUE FALSE
etoh FALSE FALSE
antacid FALSE FALSE
diet_b12 FALSE FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
smk100 supp_b12 age etoh antacid diet_b12
2 ( 1 ) "*" "*" " " " " " " " "
3 ( 1 ) "*" "*" " " "*" " " " "
4 ( 1 ) "*" "*" "*" "*" " " " "
5 ( 1 ) "*" "*" "*" "*" " " "*"
6 ( 1 ) "*" "*" "*" "*" "*" "*"
# Summarizing "winners"
winners <- tbl_df(rs$which)
winners$k <- 3:7
winners$r2 <- rs$rsq
winners$adjr2 <- rs$adjr2
winners$cp <- rs$cp
winners$bic <- rs$bic
winners# A tibble: 5 x 12
`(Intercept)` smk100 supp_b12 age etoh antacid diet_b12 k r2
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <int> <dbl>
1 T T T F F F F 3 0.0937
2 T T T F T F F 4 0.103
3 T T T T T F F 5 0.107
4 T T T T T F T 6 0.107
5 T T T T T T T 7 0.108
# ... with 3 more variables: adjr2 <dbl>, cp <dbl>, bic <dbl>
10.3.1 Building the Four Summary Plots
# Note: 1000 subjects, models for 1-5 predictors in addition to smoking (3-7 inputs)
# Adjusted R2 plot
p1 <- ggplot(winners, aes(x = k, y = adjr2,
label = round(adjr2,2))) +
geom_line() +
geom_label() +
geom_label(data = subset(winners,
adjr2 == max(adjr2)),
aes(x = k, y = adjr2, label = round(adjr2,2)),
fill = "yellow", col = "blue") +
theme_bw() +
scale_x_continuous(breaks = 3:7) +
labs(x = "# of predictors (including intercept)",
y = "Adjusted R-squared")
# Cp plot
p2 <- ggplot(winners, aes(x = k, y = cp,
label = round(cp,1))) +
geom_line() +
geom_label() +
geom_abline(intercept = 0, slope = 1,
col = "red") +
theme_bw() +
scale_x_continuous(breaks = 3:7) +
labs(x = "# of predictors (including intercept)",
y = "Mallows' Cp")
# Calculating AIC.corr
temp.n <- nrow(b12_final)
temp.inputs <- 6
rs$aic.corr <- temp.n*log(rs$rss / temp.n) +
2*(2:temp.inputs) +
(2 * (2:temp.inputs) * ((2:temp.inputs)+1) /
(temp.n - (2:temp.inputs) - 1))
winners$aic.corr <- rs$aic.corr
# Plot AIC.corr
p3 <- ggplot(winners, aes(x = k, y = rs$aic.corr,
label = round(aic.corr,1))) +
geom_line() +
geom_label() +
geom_label(data = subset(winners,
aic.corr == min(aic.corr)),
aes(x = k, y = aic.corr),
fill = "pink", col = "red") +
theme_bw() +
scale_x_continuous(breaks = 3:7) +
labs(x = "# of predictors (including intercept)",
y = "Bias-Corrected AIC")
# BIC plot - BIC is minimized by k = 2
p4 <- ggplot(winners, aes(x = k, y = bic,
label = round(bic,1))) +
geom_line() +
geom_label() +
geom_label(data = subset(winners, bic == min(bic)),
aes(x = k, y = bic),
fill = "lightgreen", col = "blue") +
theme_bw() +
scale_x_continuous(breaks = 3:7) +
labs(x = "# of predictors (including intercept)",
y = "BIC")
# All four plots together
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)10.3.2 Candidate Models from “Best Subsets”
| Tool | Inputs | Predictors |
|---|---|---|
| BIC | 4 | smk100, supp_b12, etoh |
| AICc, Cp, R2 (adj.) | 5 | smk100, supp_b12, etoh, age |
10.3.3 Comparison of Candidate Models from Best Subsets via ANOVA
lm4 <- lm(b12_trans ~ smk100 + supp_b12 + etoh, data=b12_final)
lm5 <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final)anova(lm4,lm5)Analysis of Variance Table
Model 1: b12_trans ~ smk100 + supp_b12 + etoh
Model 2: b12_trans ~ smk100 + supp_b12 + etoh + age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 996 0.10363
2 995 0.10322 1 0.00041074 3.9594 0.04688 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.4 Comparing model from best subsets to model A with RCS
Based on this ANOVA result where it appears that age results in a (barely) statistically significant increase in predictive value, I will use the linear model with 5 inputs as model B, to compare to model A above with the restricted cubic spline for supplementary B12 intake.
modelB <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final)
glance(modelB) r.squared adj.r.squared sigma statistic p.value df logLik
1 0.1066449 0.1030535 0.01018518 29.6947 2.329966e-23 5 3170.39
AIC BIC deviance df.residual
1 -6328.779 -6299.333 0.1032191 995
glance(modelA_rcs) r.squared adj.r.squared sigma statistic p.value df logLik
1 0.1789021 0.1697603 0.009799119 19.56972 6.025097e-36 12 3212.561
AIC BIC deviance df.residual
1 -6399.121 -6335.32 0.09487045 988
| Model | R2 | Adjusted R2 | AIC | BIC | degrees of freedom |
|---|---|---|---|---|---|
| A with RCS | 0.179 | 0.170 | -6399.1 | -6335.3 | 12 |
| B | 0.107 | 0.103 | -6328.8 | -6299.3 | 5 |
I then proceeded with cross-validation of each model to evaluate their performances further.
# Model A (with restricted cubic spline)
set.seed(432)
b12_modelA <- b12_final %>%
crossv_kfold(k = 10) %>%
mutate(model = map(train, ~ lm(b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data = .)))
modelA_predictions <- b12_modelA %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
head(modelA_predictions)# A tibble: 6 x 14
.id id age vit_b12 b12def smk100 bmi_cat etoh antacid supp_b12
<chr> <int> <int> <int> <int> <int> <fct> <int> <int> <dbl>
1 01 73564 61 668 0 0 Obese 0 1 40.0
2 01 73622 72 378 0 0 Overweig… 1 0 4.00
3 01 73690 77 818 0 0 Normal 0 0 2.50
4 01 73708 69 636 0 1 Overweig… 1 1 15.0
5 01 73720 48 565 0 0 Obese 1 0 6.00
6 01 73820 56 484 0 0 Overweig… 1 0 6.00
# ... with 4 more variables: diet_b12 <dbl>, b12_trans <dbl>,
# .fitted <dbl>, .se.fit <dbl>
modelA_predictions %>%
summarize(RMSE_modelA = sqrt(mean((b12_trans - .fitted) ^2)),
MAE_modelA = mean(abs(b12_trans - .fitted)))# A tibble: 1 x 2
RMSE_modelA MAE_modelA
<dbl> <dbl>
1 1.90 0.490
# Graph cross validated prediction errors for model 2
modelA_predictions %>%
mutate(errors = b12_trans - .fitted) %>%
ggplot(., aes(x = errors)) +
geom_histogram(bins = 30, fill = "darkviolet", col = "yellow") +
labs(title = "Cross-Validated Errors in Prediction of 1/sqrt(vitamin B12)",
subtitle = "Using model A",
x = "Error in predicting transformed vitamin B12")# Model B (from Best subsets)
set.seed(432)
b12_modelB <- b12_final %>%
crossv_kfold(k = 10) %>%
mutate(model = map(train, ~ lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data = .)))
modelB_predictions <- b12_modelB %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
head(modelB_predictions)# A tibble: 6 x 14
.id id age vit_b12 b12def smk100 bmi_cat etoh antacid supp_b12
<chr> <int> <int> <int> <int> <int> <fct> <int> <int> <dbl>
1 01 73564 61 668 0 0 Obese 0 1 40.0
2 01 73622 72 378 0 0 Overweig… 1 0 4.00
3 01 73690 77 818 0 0 Normal 0 0 2.50
4 01 73708 69 636 0 1 Overweig… 1 1 15.0
5 01 73720 48 565 0 0 Obese 1 0 6.00
6 01 73820 56 484 0 0 Overweig… 1 0 6.00
# ... with 4 more variables: diet_b12 <dbl>, b12_trans <dbl>,
# .fitted <dbl>, .se.fit <dbl>
modelB_predictions %>%
summarize(RMSE_modelB = sqrt(mean((b12_trans - .fitted) ^2)),
MAE_modelB = mean(abs(b12_trans - .fitted)))# A tibble: 1 x 2
RMSE_modelB MAE_modelB
<dbl> <dbl>
1 0.0103 0.00779
# Graph cross validated prediction errors for model 2
modelB_predictions %>%
mutate(errors = b12_trans - .fitted) %>%
ggplot(., aes(x = errors)) +
geom_histogram(bins = 30, fill = "darkviolet", col = "yellow") +
labs(title = "Cross-Validated Errors in Prediction of 1/sqrt(vitamin B12)",
subtitle = "Using model B",
x = "Error in predicting transformed vitamin B12")10-fold Cross-validation of models A and B revealed that Model B had lower root mean squared error and mean absolute error, as described in the table below:
| Model | R2 | Adjusted R2 | AIC | BIC | df | RMSE | MAE |
|---|---|---|---|---|---|---|---|
| A with RCS | 0.179 | 0.170 | -6399.1 | -6335.3 | 12 | 1.90 | 0.49 |
| B | 0.107 | 0.103 | -6328.8 | -6299.3 | 5 | 0.01 | 0.01 |
Given the fact that model A performed better in sample but worse upon cross validation, with using several more degrees of freedom, this was concerning for overfitting the data. Model B had much smaller RMSE and MAE compared to Model A. Therefore, I chose model B from the best subsets analysis as my final model.
10.5 Validation and calibration of final model
Next, to further validate my chosen model, I created an OLS model to calculate validated summary statistics and calibrate via residual analysis. The validated R2 statistic is 0.085, which is decreased from an already poor 0.107. The mean squared error is again excellent at 0.0001 upon validation.
ANOVA comparisons demonstrate that supplemental B12, alcohol, and age have statistically significant predictive value for the transformed B12 level. Smoking, which is the primary outcome of interest, does not appear to have a statistically significant effect on vitamin B12 level once the other covariates have been adjusted for. These findings are also supported by ANOVA plots and summary plots for the OLS model B.
summary(modelB)
Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final)
Residuals:
Min 1Q Median 3Q Max
-0.030939 -0.006692 -0.000986 0.005776 0.043631
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.252e-02 1.847e-03 23.021 < 2e-16 ***
smk100 -8.672e-04 6.809e-04 -1.274 0.20307
supp_b12 -3.996e-06 3.979e-07 -10.043 < 2e-16 ***
etoh 2.245e-03 7.733e-04 2.903 0.00378 **
age -5.479e-05 2.754e-05 -1.990 0.04688 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01019 on 995 degrees of freedom
Multiple R-squared: 0.1066, Adjusted R-squared: 0.1031
F-statistic: 29.69 on 4 and 995 DF, p-value: < 2.2e-16
# Creation of OLS model to calculate validated summary statistics
d <- datadist(b12_final)
options(datadist="d")
modelB_ols <- ols(b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final,x = TRUE, y = TRUE)
modelB_olsLinear Regression Model
ols(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final,
x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 1000 LR chi2 112.77 R2 0.107
sigma0.0102 d.f. 4 R2 adj 0.103
d.f. 995 Pr(> chi2) 0.0000 g 0.003
Residuals
Min 1Q Median 3Q Max
-0.0309389 -0.0066916 -0.0009861 0.0057756 0.0436311
Coef S.E. t Pr(>|t|)
Intercept 0.0425 0.0018 23.02 <0.0001
smk100 -0.0009 0.0007 -1.27 0.2031
supp_b12 0.0000 0.0000 -10.04 <0.0001
etoh 0.0022 0.0008 2.90 0.0038
age -0.0001 0.0000 -1.99 0.0469
# Validated summary statistics
set.seed(432123)
validate(modelB_ols) index.orig training test optimism index.corrected n
R-square 0.1066 0.1162 0.0949 0.0213 0.0853 40
MSE 0.0001 0.0001 0.0001 0.0000 0.0001 40
g 0.0028 0.0029 0.0028 0.0002 0.0026 40
Intercept 0.0000 0.0000 0.0007 -0.0007 0.0007 40
Slope 1.0000 1.0000 0.9830 0.0170 0.9830 40
# Plotting of coefficient effects
anova(modelB_ols) Analysis of Variance Response: b12_trans
Factor d.f. Partial SS MS F P
smk100 1 0.0001682933 0.0001682933 1.62 0.2031
supp_b12 1 0.0104637741 0.0104637741 100.87 <.0001
etoh 1 0.0008740486 0.0008740486 8.43 0.0038
age 1 0.0004107388 0.0004107388 3.96 0.0469
REGRESSION 4 0.0123218522 0.0030804631 29.69 <.0001
ERROR 995 0.1032191204 0.0001037378
plot(anova(modelB_ols))10.5.1 Calibration via evaluation of residual plots
The calibration via evaluation of residual versus fitted values plot below demonstrates that the there are issues with outliers in the model, particularly observation #578. The values are clustered around the mean transformed B12 value of 0.04, and appear to take a “fuzzy football” shape, but the outlier actually has a negative transformed B12 level, which is not a realistic/feasible value. The Residuals versus leverage plot indicates that this observation has both high leverage and significantly increased Cook’s distance.
# Looking at residual plots for calibration of model
plot(modelB, which=1)plot(modelB, which=2)plot(modelB, which=3)plot(modelB, which=5)10.5.2 Investigation of outlier, with sensitivity analysis
I further investigated observation #578, and it appears that this subject has an extremely high level of supplemental B12 (15100 mcg), which is likely the cause of his or her low predicted transformed B12 level in this model. I therefore dropped observation 578 and reran the model to perform a sensitivity analysis:
modelB.no578 <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final[-578,])
summary(modelB.no578)
Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final[-578,
])
Residuals:
Min 1Q Median 3Q Max
-0.031330 -0.006485 -0.001148 0.005849 0.040683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.294e-02 1.824e-03 23.545 <2e-16 ***
smk100 -7.309e-04 6.721e-04 -1.087 0.2771
supp_b12 -5.483e-06 4.814e-07 -11.391 <2e-16 ***
etoh 2.067e-03 7.635e-04 2.707 0.0069 **
age -5.535e-05 2.716e-05 -2.037 0.0419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01005 on 994 degrees of freedom
Multiple R-squared: 0.1298, Adjusted R-squared: 0.1263
F-statistic: 37.05 on 4 and 994 DF, p-value: < 2.2e-16
# Reexamine residual plots
plot(modelB.no578, which=1)plot(modelB.no578, which=2)plot(modelB.no578, which=3)plot(modelB.no578, which=5)The new model without observation 578 appears much more reasonable and does not seriously violate any assumptions of linear regression. There are no longer issues with high leverage/influence points or non-constant variance. I will therefore present my final model with validated summary statistics without observation 578.
10.6 Final Model Summary
# Model with lm
summary(modelB.no578)
Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final[-578,
])
Residuals:
Min 1Q Median 3Q Max
-0.031330 -0.006485 -0.001148 0.005849 0.040683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.294e-02 1.824e-03 23.545 <2e-16 ***
smk100 -7.309e-04 6.721e-04 -1.087 0.2771
supp_b12 -5.483e-06 4.814e-07 -11.391 <2e-16 ***
etoh 2.067e-03 7.635e-04 2.707 0.0069 **
age -5.535e-05 2.716e-05 -2.037 0.0419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01005 on 994 degrees of freedom
Multiple R-squared: 0.1298, Adjusted R-squared: 0.1263
F-statistic: 37.05 on 4 and 994 DF, p-value: < 2.2e-16
confint(modelB.no578) 2.5 % 97.5 %
(Intercept) 3.936036e-02 4.651790e-02
smk100 -2.049928e-03 5.880288e-04
supp_b12 -6.427669e-06 -4.538454e-06
etoh 5.688311e-04 3.565407e-03
age -1.086522e-04 -2.039787e-06
coef(modelB.no578) (Intercept) smk100 supp_b12 etoh age
4.293913e-02 -7.309498e-04 -5.483061e-06 2.067119e-03 -5.534598e-05
# Backtransforming to original B12 scale
1/(coef(modelB.no578)^2) (Intercept) smk100 supp_b12 etoh age
5.423673e+02 1.871651e+06 3.326242e+10 2.340287e+05 3.264584e+08
# Creation of OLS model without #578 to demonstrate nomogram and effect sizes
d <- datadist(b12_final)
options(datadist="d")
modelB.no578.ols <- ols(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final[-578,])The final model (using lm) is the following:
1/sqrt(vitamin B12 level) = 4.29e-02 - 7.31e-04(smk100) - 5.48e-06(supp_b12) + 2.07e-03(etoh) - 5.54e-05(age)
The model has an adjusted R2 of 0.1263, indicating that 12.6% of the variance in transformed B12 level is explained by the model. The ANOVA F-test p-value is <2.2e-16, indicating that the model has statistically significant predictive value.
The model can be interpreted as follows (each statement explains coefficient assuming other predictors are held constant, significance level = 0.05):
Those who smoked at least 100 cigarettes in their life, on average had a non-significant decrease in transformed vitamin B12 level of 7.31e-04 pg/ml (95% CI for decrease 2.05e-03 to -5.88e-04), compared to those who did not.
A 1 mcg increase in supplemental B12 intake resulted in a statistically significant decrease in transformed vitamin B12 level of 5.48e-06 pg/ml (95% CI for decrease 4.54e-06 to 6.43e-06).
Those who had at least 12 alcoholic beverages in the past year, on average had a statistically significant increase in transformed vitamin B12 level of 2.07e-03 pg/ml (95% CI 3.57e-03 to 5.69e-04)
A 1 year increase in age resulted in a statistically significant decrease in transformed vitamin B12 level of 1.09e-04 pg/ml (95% CI for decrease 1.09e-04 to 2.04e-06).
10.7 Predictions using final model
Next, I attempted to use my model to predict transformed vitamin B12 level for potential new subjects.I will use the model to create predictions where supplemental B12 can vary from 0 to 5000 and smoking can be either 0 or 1, with etoh = 0 and age = 61. The results are plotted below. It appears that at higher levels of supplemental B12, the standard errors for the model predictions are slightly higher.
# Tibble to demonstrate predictions
predictions <- Predict(modelB.no578.ols, supp_b12 = seq(0,5000), smk100 = c(0,1))
tbl_df(predictions)# A tibble: 10,002 x 7
smk100 supp_b12 etoh age yhat lower upper
<dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 61.0 0.0396 0.0383 0.0408
2 1.00 0 0 61.0 0.0388 0.0372 0.0405
3 0 1 0 61.0 0.0396 0.0383 0.0408
4 1.00 1 0 61.0 0.0388 0.0372 0.0405
5 0 2 0 61.0 0.0396 0.0383 0.0408
6 1.00 2 0 61.0 0.0388 0.0372 0.0405
7 0 3 0 61.0 0.0395 0.0383 0.0408
8 1.00 3 0 61.0 0.0388 0.0372 0.0405
9 0 4 0 61.0 0.0395 0.0383 0.0408
10 1.00 4 0 61.0 0.0388 0.0372 0.0404
# ... with 9,992 more rows
# Creating graph
ggplot(Predict(modelB.no578.ols, supp_b12 = 0:5000, smk100 = c(0,1)))# Augment to calculate fitted values and backtransform to original B12 scale
b12_aug <- augment(modelB.no578)
b12_aug$fit_trans <- (1/(b12_aug$.fitted^2))
b12_aug$orig_b12 <- (1/(b12_aug$b12_trans^2))I then used a nomogram to accomplish predictions for an individual subject with smk100 = 1, supp_b12 = 2500, etoh = 0, and age = 45:
smk100 = 1 –> 0 points
supp_b12 = 2500 –> 72 points
etoh = 0 —> 0 points
age = 45 —> 2 points (nomogram too small to tell exactly here)
total points = 74, which leads to a predicted transformed B12 level of 0.016, which translates to a B12 level of 3906.3 pg/ml.
plot(nomogram(modelB.no578.ols))11 Task 11: An Analysis: Logistic Regression
11.1 Exploratory Data Analysis
A Spearman \(\rho^2\) plot indicates that supplemental and dietary B12, as well as smoking, are the strongest predictors for vitamin B12 deficiency. Supplemental and dietary B12 are important confounders of this relationship that will be important to account for in this analysis.
plot(spearman2(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))11.2 Model 1 - evaluating nonlinear terms
I will begin with model 1, a logistic regression model using LRM that will include the following nonlinear terms, in addition to the main effects of the other possible predictors.
- Restricted cubic splines with 3 knots for supp_b12
- Restricted cubic splines with 3 knots for diet_b12
- An interaction term for smoking and BMI
skim(b12_final)Skim summary statistics
n obs: 1000
n variables: 11
Variable type: factor
variable missing complete n n_unique
bmi_cat 0 1000 1000 4
top_counts ordered
Obe: 372, Ove: 361, Nor: 254, Und: 13 FALSE
Variable type: integer
variable missing complete n mean sd p0 p25 median
age 0 1000 1000 61.01 11.87 41 51 61
antacid 0 1000 1000 0.14 0.35 0 0 0
b12def 0 1000 1000 0.048 0.21 0 0 0
etoh 0 1000 1000 0.74 0.44 0 0 1
id 0 1000 1000 78566.26 2937.18 73559 75995.25 78582
smk100 0 1000 1000 0.45 0.5 0 0 0
vit_b12 0 1000 1000 875.15 943.18 154 481 669
p75 p100 hist
70 80 ▆▅▆▇▇▆▅▇
0 1 ▇▁▁▁▁▁▁▁
0 1 ▇▁▁▁▁▁▁▁
1 1 ▃▁▁▁▁▁▁▇
81011.5 83721 ▇▇▇▇▇▇▇▇
1 1 ▇▁▁▁▁▁▁▆
933.25 12494 ▇▁▁▁▁▁▁▁
Variable type: numeric
variable missing complete n mean sd p0 p25 median
b12_trans 0 1000 1000 0.039 0.011 0.0089 0.033 0.039
diet_b12 0 1000 1000 4.76 4.11 0.16 2.49 3.8
supp_b12 0 1000 1000 256.76 811.31 0.16 6 25
p75 p100 hist
0.046 0.081 ▁▂▆▇▅▁▁▁
5.86 54.41 ▇▂▁▁▁▁▁▁
75 15100 ▇▁▁▁▁▁▁▁
# Creating model 1 with nonlinear terms
dd <- datadist(b12_final)
options(datadist = "dd")
model_01 <- lrm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + rcs(supp_b12,3) + rcs(diet_b12,3) + smk100 %ia% bmi_cat, data = b12_final, x = TRUE, y = TRUE)
model_01Logistic Regression Model
lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid +
rcs(supp_b12, 3) + rcs(diet_b12, 3) + smk100 %ia% bmi_cat,
data = b12_final, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 21.09 R2 0.065 C 0.697
0 952 d.f. 14 g 0.889 Dxy 0.394
1 48 Pr(> chi2) 0.0993 gr 2.432 gamma 0.394
max |deriv| 1 gp 0.034 tau-a 0.036
Brier 0.045
Coef S.E. Wald Z Pr(>|Z|)
Intercept -1.9845 1.0060 -1.97 0.0485
age -0.0061 0.0131 -0.46 0.6429
smk100 0.7837 0.5560 1.41 0.1586
bmi_cat=Obese 0.4633 0.5280 0.88 0.3802
bmi_cat=Overweight -0.6434 0.6585 -0.98 0.3286
bmi_cat=Underweight -4.4729 21.6512 -0.21 0.8363
etoh 0.3036 0.3948 0.77 0.4419
antacid -0.3077 0.4884 -0.63 0.5287
supp_b12 -0.0036 0.0023 -1.54 0.1233
supp_b12' 0.0547 0.0412 1.33 0.1839
diet_b12 -0.2703 0.1325 -2.04 0.0414
diet_b12' 0.2724 0.1751 1.56 0.1198
smk100 * bmi_cat=Obese -0.7407 0.7159 -1.03 0.3009
smk100 * bmi_cat=Overweight -0.0654 0.8433 -0.08 0.9382
smk100 * bmi_cat=Underweight 5.2158 21.6812 0.24 0.8099
anova(model_01) Wald Statistics Response: b12def
Factor Chi-Square d.f. P
age 0.22 1 0.6429
smk100 (Factor+Higher Order Factors) 3.24 4 0.5183
All Interactions 1.40 3 0.7056
bmi_cat (Factor+Higher Order Factors) 6.23 6 0.3983
All Interactions 1.40 3 0.7056
etoh 0.59 1 0.4419
antacid 0.40 1 0.5287
supp_b12 4.15 2 0.1257
Nonlinear 1.77 1 0.1839
diet_b12 5.47 2 0.0650
Nonlinear 2.42 1 0.1198
smk100 * bmi_cat (Factor+Higher Order Factors) 1.40 3 0.7056
TOTAL NONLINEAR 4.12 2 0.1272
TOTAL NONLINEAR + INTERACTION 5.61 5 0.3457
TOTAL 18.13 14 0.2010
plot(anova(model_01))Based on this ANOVA and summary statistics above, none of the candidate main effects or the nonlinear terms appear to have a statistically significant impact in predicting vitamin B12 deficiency. Therefore, I will not use nonlinear predictors in this analysis.
11.3 Model 2 - Kitchen sink model with main effects only, evaluate variable selection with stepwise regression
# LRM fit
dd <- datadist(b12_final)
options(datadist = "dd")
model_ks_lrm <- lrm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final, x= T, y = T)
model_ks_lrmLogistic Regression Model
lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid +
supp_b12 + diet_b12, data = b12_final, x = T, y = T)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 15.34 R2 0.048 C 0.658
0 952 d.f. 9 g 0.804 Dxy 0.316
1 48 Pr(> chi2) 0.0820 gr 2.235 gamma 0.317
max |deriv| 0.005 gp 0.029 tau-a 0.029
Brier 0.045
Coef S.E. Wald Z Pr(>|Z|)
Intercept -2.2676 0.9234 -2.46 0.0141
age -0.0065 0.0130 -0.50 0.6189
smk100 0.4509 0.3180 1.42 0.1562
bmi_cat=Obese 0.0281 0.3537 0.08 0.9366
bmi_cat=Overweight -0.6699 0.4081 -1.64 0.1007
bmi_cat=Underweight 0.4213 1.0900 0.39 0.6991
etoh 0.2607 0.3909 0.67 0.5048
antacid -0.3574 0.4866 -0.73 0.4627
supp_b12 -0.0007 0.0005 -1.54 0.1244
diet_b12 -0.0966 0.0579 -1.67 0.0952
# GLM fit for stepwise regression
model_ks_glm <- glm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final, family = binomial)
summary(model_ks_glm)
Call:
glm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid +
supp_b12 + diet_b12, family = binomial, data = b12_final)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5208 -0.3592 -0.2938 -0.2236 2.8189
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2675525 0.9233691 -2.456 0.0141 *
age -0.0064595 0.0129853 -0.497 0.6189
smk100 0.4509053 0.3180127 1.418 0.1562
bmi_catObese 0.0281224 0.3536579 0.080 0.9366
bmi_catOverweight -0.6698961 0.4080915 -1.642 0.1007
bmi_catUnderweight 0.4213010 1.0900244 0.387 0.6991
etoh 0.2606947 0.3908961 0.667 0.5048
antacid -0.3573551 0.4865757 -0.734 0.4627
supp_b12 -0.0007386 0.0004806 -1.537 0.1244
diet_b12 -0.0965625 0.0578693 -1.669 0.0952 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 385.17 on 999 degrees of freedom
Residual deviance: 369.83 on 990 degrees of freedom
AIC: 389.83
Number of Fisher Scoring iterations: 7
step(model_ks_glm)Start: AIC=389.83
b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 +
diet_b12
Df Deviance AIC
- age 1 370.07 388.07
- bmi_cat 3 374.27 388.27
- etoh 1 370.29 388.29
- antacid 1 370.41 388.41
<none> 369.83 389.83
- smk100 1 371.86 389.86
- diet_b12 1 373.37 391.37
- supp_b12 1 373.82 391.82
Step: AIC=388.07
b12def ~ smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12
Df Deviance AIC
- bmi_cat 3 374.57 386.57
- etoh 1 370.67 386.67
- antacid 1 370.68 386.68
- smk100 1 371.98 387.98
<none> 370.07 388.07
- diet_b12 1 373.64 389.64
- supp_b12 1 374.23 390.23
Step: AIC=386.57
b12def ~ smk100 + etoh + antacid + supp_b12 + diet_b12
Df Deviance AIC
- etoh 1 374.94 384.94
- antacid 1 375.18 385.18
<none> 374.57 386.57
- smk100 1 376.62 386.62
- diet_b12 1 378.23 388.23
- supp_b12 1 378.32 388.32
Step: AIC=384.94
b12def ~ smk100 + antacid + supp_b12 + diet_b12
Df Deviance AIC
- antacid 1 375.50 383.50
<none> 374.94 384.94
- smk100 1 377.84 385.84
- diet_b12 1 378.45 386.45
- supp_b12 1 378.72 386.72
Step: AIC=383.5
b12def ~ smk100 + supp_b12 + diet_b12
Df Deviance AIC
<none> 375.50 383.50
- smk100 1 378.44 384.44
- diet_b12 1 379.11 385.11
- supp_b12 1 379.21 385.21
Call: glm(formula = b12def ~ smk100 + supp_b12 + diet_b12, family = binomial,
data = b12_final)
Coefficients:
(Intercept) smk100 supp_b12 diet_b12
-2.7169067 0.5116181 -0.0007194 -0.0964161
Degrees of Freedom: 999 Total (i.e. Null); 996 Residual
Null Deviance: 385.2
Residual Deviance: 375.5 AIC: 383.5
The model indicated by backwards stepwise elimination is the following:
b12def ~ smk100 + supp_b12 + diet_b12
Therefore, I will compare this model (model 3) to the kitchen sink model via various tests below.
dd <- datadist(b12_final)
options(datadist = "dd")
model_03_lrm <- lrm(b12def ~ smk100 + supp_b12 + diet_b12, data=b12_final, x = TRUE, y = TRUE)
model_03_glm <- glm(b12def ~ smk100 + supp_b12 + diet_b12, data=b12_final, family = binomial)11.4 Comparing kitchen sink model to model 3 (smoking, supplementary and dietary B12)
11.4.1 ANOVA
Based on the output below, it does not appear that model 3 improves upon the kitchen sink model via ANOVA.
anova(model_ks_glm, model_03_glm)Analysis of Deviance Table
Model 1: b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 +
diet_b12
Model 2: b12def ~ smk100 + supp_b12 + diet_b12
Resid. Df Resid. Dev Df Deviance
1 990 369.83
2 996 375.50 -6 -5.6701
pchisq(5.6701,6, lower.tail=FALSE)[1] 0.4611408
11.4.2 AIC and BIC
On AIC/BIC evaluation, Model 3 has lower AIC and BIC than the kitchen sink model.
glance(model_ks_glm) null.deviance df.null logLik AIC BIC deviance df.residual
1 385.1674 999 -184.9134 389.8268 438.9044 369.8268 990
glance(model_03_glm) null.deviance df.null logLik AIC BIC deviance df.residual
1 385.1674 999 -187.7485 383.497 403.128 375.497 996
11.4.3 ROC Curves/C-statistic
# ROC curve for kitchen sink model (AUC = 0.658)
roc_model_ks <- roc(b12_final$b12def ~ predict(model_ks_glm, type = "response"), ci = TRUE)
roc_model_ks
Call:
roc.formula(formula = b12_final$b12def ~ predict(model_ks_glm, type = "response"), ci = TRUE)
Data: predict(model_ks_glm, type = "response") in 952 controls (b12_final$b12def 0) < 48 cases (b12_final$b12def 1).
Area under the curve: 0.6579
95% CI: 0.5856-0.7303 (DeLong)
plot(roc_model_ks)# ROC curve for model 3 (AUC = 0.616)
roc_model_03<- roc(b12_final$b12def ~ predict(model_03_glm, type = "response"), ci = TRUE)
roc_model_03
Call:
roc.formula(formula = b12_final$b12def ~ predict(model_03_glm, type = "response"), ci = TRUE)
Data: predict(model_03_glm, type = "response") in 952 controls (b12_final$b12def 0) < 48 cases (b12_final$b12def 1).
Area under the curve: 0.6155
95% CI: 0.5389-0.6921 (DeLong)
plot(roc_model_03)11.4.4 Nagelkerke’s R-squared
Based on the LRM output, Nagelkerke’s R2 is 0.048 for the kitchen sink model, and 0.030 for model 3. Neither of the values are particularly good, but the kitchen sink model appears to be better in this regard.
model_ks_lrmLogistic Regression Model
lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid +
supp_b12 + diet_b12, data = b12_final, x = T, y = T)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 15.34 R2 0.048 C 0.658
0 952 d.f. 9 g 0.804 Dxy 0.316
1 48 Pr(> chi2) 0.0820 gr 2.235 gamma 0.317
max |deriv| 0.005 gp 0.029 tau-a 0.029
Brier 0.045
Coef S.E. Wald Z Pr(>|Z|)
Intercept -2.2676 0.9234 -2.46 0.0141
age -0.0065 0.0130 -0.50 0.6189
smk100 0.4509 0.3180 1.42 0.1562
bmi_cat=Obese 0.0281 0.3537 0.08 0.9366
bmi_cat=Overweight -0.6699 0.4081 -1.64 0.1007
bmi_cat=Underweight 0.4213 1.0900 0.39 0.6991
etoh 0.2607 0.3909 0.67 0.5048
antacid -0.3574 0.4866 -0.73 0.4627
supp_b12 -0.0007 0.0005 -1.54 0.1244
diet_b12 -0.0966 0.0579 -1.67 0.0952
model_03_lrmLogistic Regression Model
lrm(formula = b12def ~ smk100 + supp_b12 + diet_b12, data = b12_final,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 9.67 R2 0.030 C 0.616
0 952 d.f. 3 g 0.642 Dxy 0.231
1 48 Pr(> chi2) 0.0216 gr 1.901 gamma 0.232
max |deriv| 0.002 gp 0.023 tau-a 0.021
Brier 0.045
Coef S.E. Wald Z Pr(>|Z|)
Intercept -2.7169 0.3166 -8.58 <0.0001
smk100 0.5116 0.2995 1.71 0.0876
supp_b12 -0.0007 0.0005 -1.49 0.1362
diet_b12 -0.0964 0.0572 -1.69 0.0916
11.4.5 Validating summary statistics
For the kitchen sink model, the corrected Nagelkerke’s R2 is 0.001, which is worse than the original estimate of 0.048. The corrected C-statistic is 0.596, which is decreased from the original value of 0.658.
For model 3, the corrected Nagelkerke’s R2 is 0.005, which is again much worse than an already poor estimate of 0.030. The corrected C-statistic is 0.580, which is again less than the original value of 0.616.
However, model 3 marginally outperforms the kitchen sink model here.
validate(model_ks_lrm) index.orig training test optimism index.corrected n
Dxy 0.3158 0.3863 0.2623 0.1239 0.1919 40
R2 0.0476 0.0744 0.0277 0.0466 0.0010 40
Intercept 0.0000 0.0000 -1.3536 1.3536 -1.3536 40
Slope 1.0000 1.0000 0.5285 0.4715 0.5285 40
Emax 0.0000 0.0000 0.4575 0.4575 0.4575 40
D 0.0143 0.0230 0.0079 0.0151 -0.0007 40
U -0.0020 -0.0020 0.0059 -0.0079 0.0059 40
Q 0.0163 0.0250 0.0020 0.0230 -0.0066 40
B 0.0451 0.0446 0.0455 -0.0009 0.0460 40
g 0.8044 1.1368 0.5670 0.5698 0.2346 40
gp 0.0294 0.0358 0.0212 0.0145 0.0149 40
validate(model_03_lrm) index.orig training test optimism index.corrected n
Dxy 0.2310 0.3093 0.2283 0.0810 0.1500 40
R2 0.0301 0.0501 0.0247 0.0254 0.0047 40
Intercept 0.0000 0.0000 -0.8609 0.8609 -0.8609 40
Slope 1.0000 1.0000 0.7065 0.2935 0.7065 40
Emax 0.0000 0.0000 0.2718 0.2718 0.2718 40
D 0.0087 0.0152 0.0069 0.0082 0.0004 40
U -0.0020 -0.0020 0.0101 -0.0121 0.0101 40
Q 0.0107 0.0172 -0.0032 0.0203 -0.0097 40
B 0.0453 0.0461 0.0456 0.0006 0.0448 40
g 0.6424 1.3365 0.5654 0.7710 -0.1286 40
gp 0.0227 0.0285 0.0197 0.0088 0.0139 40
\[ C = 0.5 + \frac{Dxy}{2} = 0.5 + \frac{.1917}{2} = 0.5 + 0.0959 = 0.596 \] \[ C = 0.5 + \frac{Dxy}{2} = 0.5 + \frac{.1601}{2} = 0.5 + 0.0801 = 0.580 \]
11.4.6 Comparing calibration of models
Based on the calibration plots below, neither model appears to be well calibrated in predicting vitamin B12 deficiency. It appears that model 3 may again marginally outperform the kitchen sink model here, as the bias-corrected line stays closer to the ideal line.
plot(calibrate(model_ks_lrm))
n=1000 Mean absolute error=0.007 Mean squared error=1e-04
0.9 Quantile of absolute error=0.009
plot(calibrate(model_03_lrm))
n=1000 Mean absolute error=0.005 Mean squared error=4e-05
0.9 Quantile of absolute error=0.01
11.4.7 Choosing final model
| Model | C-statistic | Nagelkerke’s R2 | AIC | BIC | Brier score |
|---|---|---|---|---|---|
| Kitchen sink | 0.658 | 0.048 | 389.8 | 438.9 | 0.045 |
| Model 3 | 0.616 | 0.030 | 383.5 | 403.1 | 0.045 |
Based on the summary of comparision information above, I choose model 3 as my final model. Although AIC and BIC were lower for model 3, I felt its performance in terms of discrimination and calibration were consistently, though marginally, superior to the kitchen sink model.
11.5 Validating and Calibrating Final Model (model 3)
The validation below indicates a corrected C-statistic of 0.586 and Nagelkerke’s R-squared of 0.0128. These measures both indicate overall poor discriminatory ability.
set.seed(432321)
validate(model_03_lrm, B = 100) index.orig training test optimism index.corrected n
Dxy 0.2310 0.2873 0.2287 0.0586 0.1724 100
R2 0.0301 0.0420 0.0247 0.0173 0.0128 100
Intercept 0.0000 0.0000 -0.6975 0.6975 -0.6975 100
Slope 1.0000 1.0000 0.7565 0.2435 0.7565 100
Emax 0.0000 0.0000 0.2167 0.2167 0.2167 100
D 0.0087 0.0126 0.0069 0.0057 0.0030 100
U -0.0020 -0.0020 0.0026 -0.0046 0.0026 100
Q 0.0107 0.0146 0.0043 0.0103 0.0003 100
B 0.0453 0.0454 0.0455 -0.0001 0.0454 100
g 0.6424 0.9133 0.5625 0.3508 0.2916 100
gp 0.0227 0.0261 0.0199 0.0063 0.0165 100
plot(calibrate(model_03_lrm))
n=1000 Mean absolute error=0.007 Mean squared error=5e-05
0.9 Quantile of absolute error=0.009
As demonstrated previously, the calibration of the final model is also suboptimal. The bias-corrected line deviates from the ideal line, and whether it is over or under depends upon the value of the predicted probability. Based on this information, I would use caution in trusting the predictions made by this model.
inf <- which.influence(model_03_lrm, cutoff = 0.3)
show.influence(object = inf, dframe = data.frame(b12_final)) Count supp_b12 diet_b12
111 1 *1400 1.215
190 1 *1000 2.015
232 1 *1000 1.885
490 1 *1000 1.475
984 1 12 *11.635
plot(model_03_glm, which=5)In evaluating influence, there appear to be 4 values of supp_b12 and 1 value of diet_b12 that are particularly influential on the model. However, none of these have a Cook’s distance >0.5.
11.6 Summarizing Final Model
The final model, model 3, is as follows:
Log odds of vitamin B12 deficiency = -2.72 + 0.51(smk100) - 0.0007(supp_b12) - 0.096(diet_b12)
The odds ratios from the final model, as indicated by the GLM model, are as follows. All odds ratios are reported assuming the other covariates in the model are held constant.
Compared to those who did not smoke at least 100 cigarettes in their lifetime, those who smoked at least 100 cigarettes had 1.668 times greater the odds (95% CI 0.93 to 3.03) of having B12 deficiency. However, this result was not statistically significant.
The odds ratio of B12 deficiency associated with a 1 mcg increase in supplementary B12 was 0.999 (95% CI 0.998 to 1.00), and this was not statistically significant.
The odds ratio of B12 deficiency associated with a 1 mcg increase in dietary B12 was 0.908 (95% CI 0.80 to 1.00), but this was not statistically significant
Based on the LRM model, which is also reflected in the graph below:
Compared to those who did not smoke at least 100 cigarettes in their lifetime, those who smoked at least 100 cigarettes had 1.668 times greater the odds (95% CI 0.93 to 3.00) of having B12 deficiency. However, this result was not statistically significant
The odds ratio of B12 deficiency associated with an increase from the 25th percentile (6.0 mcg) to 75th percentile (75.0 mcg) of supplementary B12 was 0.95 (95% CI 0.89 to 1.02), and this was not statistically significant. In other words, increasing supplementary B12 intake was associated with a non-significant reduction in the odds of B12 deficiency.
The odds ratio of B12 deficiency associated with an increase from the 25th percentile (2.49 mcg) to 75th percentile (5.86 mcg) of dietary B12 intake was 0.72 (95% CI 0.50 to 1.05), and this was not statistically significant. In other words, increasing dietary B12 intake was associated with a non-significant reduction in the odds of B12 deficiency.
summary(model_03_glm)
Call:
glm(formula = b12def ~ smk100 + supp_b12 + diet_b12, family = binomial,
data = b12_final)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4527 -0.3502 -0.3043 -0.2570 2.6855
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7169067 0.3166447 -8.580 <2e-16 ***
smk100 0.5116181 0.2995068 1.708 0.0876 .
supp_b12 -0.0007194 0.0004827 -1.490 0.1362
diet_b12 -0.0964161 0.0571548 -1.687 0.0916 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 385.17 on 999 degrees of freedom
Residual deviance: 375.50 on 996 degrees of freedom
AIC: 383.5
Number of Fisher Scoring iterations: 7
exp(coef(model_03_glm))(Intercept) smk100 supp_b12 diet_b12
0.06607884 1.66798804 0.99928090 0.90808605
exp(confint(model_03_glm)) 2.5 % 97.5 %
(Intercept) 0.03512361 0.1215716
smk100 0.93014116 3.0307757
supp_b12 0.99808267 1.0000080
diet_b12 0.80364933 1.0024600
summary(model_03_lrm) Effects Response : b12def
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
smk100 0.0000 1.0000 1.00 0.511620 0.299510 -0.075405 1.098600
Odds Ratio 0.0000 1.0000 1.00 1.668000 NA 0.927370 3.000100
supp_b12 6.0000 75.0000 69.00 -0.049636 0.033309 -0.114920 0.015649
Odds Ratio 6.0000 75.0000 69.00 0.951580 NA 0.891440 1.015800
diet_b12 2.4925 5.8625 3.37 -0.324920 0.192610 -0.702430 0.052590
Odds Ratio 2.4925 5.8625 3.37 0.722580 NA 0.495380 1.054000
plot(summary(model_03_lrm))11.7 Making Predictions - Nomogram
plot(nomogram(model_03_lrm, fun = plogis,
fun.at=c(0.05, seq(0.1, 0.9, by = 0.1), 0.95),
funlabel="Pr(low = 1)"))11.8 Making predictions - Graphs
**** add more here *****
ggplot(Predict(model_03_lrm))# on probability scale
ggplot(Predict(model_03_lrm, fun = plogis))12 Task 12: Discussion/Conclusions
My project was intended to evaluate the effects of smoking on vitamin B12 deficiency, adjusting for known risk factors and confounders of vitamin B12 deficiency (particularly age, antacid use, and dietary/supplementary B12 intake). Linear regression modeling revealed that although smoking was not statistically significantly associated with serum B12 level, supplemental B12 intake, alcohol intake, and age were.
%%% put in directions of associations once transformation figured out %%%
Based upon the logistic regression, none of the hypothesized covariates appeared to have statistically significant effects on B12 deficiency, particularly the main predictor of interest, smoking.
Although the overall sample size was adequate (n = 1000), the study was somewhat limited by a small proportion of patients who had the outcome of interest for the logistic regression model (n = 48), even when the threshold for vitamin B12 deficiency was increased to 300 pg/ml. This issue likely limited my ability to fully evaluate the effects of the predictors of interest. Another issue with this study was that I utilized a complete case analysis, based on the available data from NHANES. Again, although the sample size was adequate, the study was likely susceptible to selection bias which may have resulted in certain patients (most concerning of whom are those with the outcome of interest) being excluded from the analysis. If I had to redo this project, I would have tried to include more patients with the outcome of interest and utilized imputation strategies to have a more representative sample of the NHANES population.